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Using Residual Analyses to Assess Item 
Response Model-Test Data Fit 



Linda N. Murray and Ronald K. Hambteton 
University of Massachusetts, Amherst 



Abstract 



The purpose of this research study was to assess item response 
model -test data fit using residuals. First, a comparison of raw and 
standardized Residuals for describing model-test data fit was carried 
out. Second, hypotheses concerning the relationship between residual 
sizes and several item characteristics were studied. The analyses 
with residuals were carried out with NAEP' mathematics test data using 
the one-, two-, and three-parameter logistic test models. The results 
from the investigation highl ighted clearly the advantages of 
addressing the question of model -test data fit with residuals. 




; Presently, there is considerable interest in applying the one-, 
two- and three-parameter logistic item response models to a wide 
variety of educational and psychological measurement areas. These 
areas include detection of item bias, adaptive testing, mastery 
testing, item banking, test development, and test score equating 
(Hambleton, 1983; Lord, 1980; Traub & Wolfe, 1981). However, the 
benefits of item response theory are predicated upon an adequate fit 
betweeW- the chosen model and the set of test data. Clearly no 
psychologically meaningful test" model can eyer fit a data set 
perfectly. But without sufficient model-^est data fit, the desirable 
features of an item response model will not be obtained or obtained in 
a low degree. 

Goodness of fit studies are helpful in c^osessing the utility of 
an item response model for solving specific measurement problems with a 
particular test data set. Hambleton, Murray and Simon (1982) organized 
and reviewed many goodness of fit procedures that have been advocated 
and documented in the research literature. The procedures they found 
can be grouped into several general categories- These categories 
include (1) statistical tests for assessing model-data fit, (2) 
verifying model assumptions and expec ted mo.del features, and (3) 
checking model predictions wxth test results. It was determined that 
these procedures varied substantially, in their level of practicality 
and effectiveness. For example, they found that much attention was 
focused on the use of statistical tests where unfortunately model-data 
fit depended upon the sizes of examinee samples used in the studies. 
The statistical values coxild become significant due principally to 
large sample sizes (Hambleton, Murray & Simon, 1982). 

Analyses of residuals offer another means of examining model-data 
ftt. These analyses are more pract ical than many of the other fit 
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methods and they often provide a more effective way of revealing 
instances or pat;terns of misfit (Traub & Wolfe, 1981). Residual 
analyses have played an important role in determining the suitability 
of regression models (Draper & Smith, 1966; Anscombe & Tukey, 1963; 
Seber, 197 7). On the other hand, residual analyses have no,t been used 
to any substantial extent to investigate the appropriateness of item 
response models. 

A residual analysis involves the following steps: (1) a model is 
chosen and model parameters are estimated from the data; (2) the 
estimates are substituted into the model and predictions are made; and 
(3) discrepancies (residuals) between the data and values predicted by 
the model are examined. The overall quality and . sui tabil ity of the 
model and the usability of the results are evaluated by examining the 
size and direction of the residuals and variations such as 
absolute-valued residuals. Sometimes the residuals are plotted as a 
function of ability to determine more precisely 'che nature of 
model-test data misfit. . 

Intone recent study, Hambleton and Murray (19B3) examined the 
size and pattern of standardized residuals using the one-parameter and 
three-parameter logistic item response models. They also explored the 

-relationship between selected item characteristics such as content 
categories and item format and the size of standardized residuals. 
Overall their research study revealed that residual analyses helped 
cpnsiderably in judging the suitability of the two item response 

' models . 

V'.e purpose of this research study was to expand on the earlier 
residual analysis work of Hambleton, Murray and Simon (1982) and 
Hambleton and Murray ( 1983 ). More specifically, this research 



investigation was designed to address two topics: 

1* Comparison of raw and standardized residuals for describing 

model-data fit. « 
2, Hypotheses concerning the relationship between fit of test - 

items and item format, difficulty level, discrimination 

level, item weeding, and various other salient aspects of 

test items • 

With respect to the first topic, this study extended the earlier work 
by Hamble'con and Murray (1983) by reporting raw residuals, and in 
addition, compared raw and standardized residuals for the purpose of 
describing model-data fit. With respect to the second topic, this 
study investigated the fit of three logistic models rather' than two 
models and considered several additional hypo the s e s .which v?ere not 
examined in the earlier study. 



Method 

Description of the Tests 

Four National Assessment of Educational Progifess (NAEP) test 
booklets from the 1977-78 assessment were selected for analysis: 

9 Year Olds 
Booklet No, 1, 65 items, 2495 examinees 
Booklet No. 2, 75 items, 2463 examinees 

13 Year Olds 
Booklet No. 1, 58 items, 2422 exc-Hiinees ; 
Booklet No. 2, 62 items, 2433 exami.iee^ ( 



Each booklet contained test items measuring various mathematical skills 
M. the areas of definitions, story problems, geometry, measurement, and 
graphs and figures. The test items in the NAEP assessment were either 
multiple-choice or open-ended. Finally, these data sets were unusual 
in the sense that the test items varied substantially in both their 
range of difficulty (.02 to .98) and their rang^ of item discrimination 
levels (-.01 to .99). These ranges far exceed those ranges, normally 
found in achievement and aptitude tests. Because of the wide range, of 
classical item di sc r imijia t ion indices and the high level of guessing 
due to the substantiaj.^ number of difficult items, we expected that the 
three-parameter model would fit» the test data substantially better than 
the other two more restrictive models. 

Residual Analy.Ses 

Each analysis in this study began with the calculation of the raw 
and standardized residuals. Raw residuals are comparisons of predicted 
performance results with actual performance results. To calculate 
residuals an item response model was first chosen. For this study the 
one-, two-, and three-parameter logistic test models were used in 
separate but identical analyses.^ Next, item'and ability parameter 
estimates were obtained using the LOGIST computer program (Wood, 
^ Wirigersky & Lord, 1976). To find the actual performance results, an 
examinee was placed in an ability category based on his or her 
estimated ability level. For this^ investigation, ability categories 
were chosen that divided the ability scale between -3.0 and 3.0 into 12 
equal intervals. Ability estimates that fell beyond these maximum and 
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minimum ability levels were deleted from the analysis. In every 
investigation, this wf«s usually less than 10 cases. For each of the 12 
ability categories, the average observed performance (P^j) for an item 
i in ability category j was f oun<1 . For example, if 10 of 50 examinees 
in ability category j answered item i correctly, than P^j would.be .2. 

The process was repeated for each ability category (j=l, 2 12) 

and for each .item (i=l, 2, n) in a test booklet. 

•xJUsing the midpoint of each ability category (i.e., -2. 75, -2.25, 
-2S +75 ....+2.75) as the average ability level for that 
group of examinees, the expected performance (E^j) for item i in 
ability category j was found in the usual way: 

-(3) , ^i.7.-.i(ei-bi) 

for the three-parameter logistic model. 



,(2) 1.7ai(0j-bi) 

p - e ^ 



iJ j^+3l.7ai(0i-bi) 
for the two-parameter logistic model, and 



.(1) (0j-bi) 



for the one-parameter logistic model. 

In these equations a-, b^ and c^ are the item parameter estimates 

'■'O .th 
Obtained from LOGIST (Lord, 1980) and Qj is the mid-point .of the j 

ability category. 
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Then the raw residual (R^j) itc"^ ^ ability category J was 

This difference is an index of the degree of misfit between the 
test data and the expected item performance based on the'chosen 
model. Large positive raw residuals indicate that examinees are 
performing considerably better on" an item than is predicted by the item 
response model. Large negative raw residuals reveal that the model is 
predicting a much higher performance level by the examinees on the item 
then is actually observed. Finally, evidence of sufficient model-data 
fit occurs when the residuals are small and there are no obvious 
patterns in the residuals across ability levels. 

Next, these raw residuals were transformed to standardized 
residuals (SKj^j) by dividing Rj^j by the sampling error associated with 
the average expected .performance level in an ability category (Blalock, 
1979). That is, . . 

' 0 

p.. -p.. 

11 11 

SR. . = 



I P..(l-P..) 



N. 

J 



where Nj is the number of examinees in ability category j. 

These raw and standardized residuals differ in several ways. Raw 
residuals are simpler to calculate and easier to interpret than 
standardized residuals. On the other hand, standardized residuals take 
into account the sampling errors associated with Pij* When Nj is 
small, other things being equal, big differences between actual and 
expected differences must be obtained for the differences to be taken 
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as an indication of moclol-Cost: clat/i miafic. For example, suppoae two 
different ability catcgorioa for an item i have the same computed raw 
residual (.3-. 2), but differ in their examinee sample sizes (10 vs 
100)r Usinp; the raw residuals, it appears that model data fit is the 
same in both examinee samples. But, the greater number of examinees 
(NO produces a smaller standard error of expected performance level 
because a more accurate estimate ifJ possible. Then, the corresponding 
standardized residuals are .79 and 2.5. Clearly,' the two statistics 
seem to give a very different picture of model-data fit. Therefore, a 
comparison of raw and standardized residuals was made to determine how 
differently' they described levels of model-data fit and whether the 
choice of statistic might affect the decision about the usefulness of 
item response models. The size and direction of the raw and 
standardized residuals in the analyses were compared in three ways: 
(1) across ability levels for each item; (2) across items at each 
ability level; and (3) across both ability levels and test items. 

Research Hypotheses 

Several testable research hypotheses were generated concerning 
model-data fit. Specifically, interest centered on determining if test 
tuems having large positive or negative standardized residuals exhibit 
certain salient item characteristics that would cause them to be misfit 
by an item response model. To reduce problems associated with studying 
curvilinear relationships, absolute-valued standardized residuals were 
used instead of standardized residuals. Then, analyses were conducted 
concerning the association between the fit of test items and item 
format, and classical indices of item difficulty and discrimination. 

er|c lo 
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Comparison of Raw and Stnndardizcd Roaidunla • 

■ Table 1 displayp the intercprrelationa among several of the NAEP 
math item variables. There, is a strong relationship between Che 
on'o-paramoter raw and standardized residuals (r-.9l) sug, .sting they 
describe model-data fit in similar fashions. The correlations between 
the twoTparameter and three-parameter raw residuals with their 
corresponding standardized residuals are lower (r=.77). But, these 
correlations are prdbatly only lower du^ to range restriction on the 
variables as shown by the standard, deviations listed in Table 1. 

Absolute valued raw and standardized residuals for each of the 
logistic models are similarly correlated with difficulty, item fprmat 
and item order. Because of the non-linear relationship, associations 
between item discrimination (as measured by biserial correlations) 
and the residuals were investigated by examining the plots shown in 
Figures 1 through 6. Figures 1 and 2 are plots of raw residuals and 
standardized residuals versus classical item discrimination indices. 
These figures show clearly that for the one-parameter model, a 
curvilinear relationship prevailed whether raw or standardized 
res.iduals were used to- describe fit (i.e., very low or high 
discriminating items had larger residuals with the one-parameter 
model). Small differences between the results in these plots emerged 
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Table 



Variable 



Statistics of and Intercorrelations Among Several NAEP Math' Item Variables 
(Booklet Nos. 1 and 2,' 260 Items, !3 and 9 Year Olds, 1977-78) 



Standard 



Mean 



Deviation |SR{2-P)| |SR(3-P)| |rR(1-P)| |RR(2-P)MRR(3-P)h P. " .O 



I Standardized 
Residual (l-P)| 1.98 



1.20 



.24 



.18 . .91 



.35. 



.33 -.30 -.25 .14 



[standardized 

Residual (2-P)| 1.01 .42 
I Standardized 

■Residual (3-P)| .88 .42 



.41 .08 .77 .30 -.21 ^.U .00 



.15 .27 ' .77 .09 .07 -.03 



|Raw 

Residual (i-P) I .060 .033 



.24^ 



.34 -.17 -.19 .09 



I Raw 

Residual (2-P)| 



.033 ,017 



.43 -.22 -.34, .13 



I Raw 

Residual (3vP)| 



.030 .017 



-.07 -.17 .14 



Item 

Difficulty (P) 
Format (F) 
Item- Order (O) 



1 



.53 



.27 



Two types; Multiple-choice and Open-ended. 



12 



.04 -.40 
-.12 



13 
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Item Discrimination (Biserial Correlation) 



Figure 1. Plot of pne-pa rametor model raw residuals versus item^^ 
discrimination. 
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Item Discrimination ("Biserial Correlation) 

Figure 2. Plot of one-parameter model standardized residuals versus 
item discrimination. 
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Figure 3. Plot of two-parameter model raw residuals versus item 
zirimination. 
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Figure 4. Plot of two-parameter model standardized residuals versus 
item discrimination. 
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Figure 5. Plot of: three-parameter model raw residuals versus item 
discrimination. . * 



03 

a 

•H 
W 
QJ 

0) 
N 
•H . 

U 

03 
03 

C 

03 

iJ 
05 



7 .cn 



'♦.90 



.50 



' I 



9 I « •■■•7 ▼l«»«aB-«B *• 

I 

T 

: 



.AO 



. QO 



EKLC 



Item Discrimination (Biserial Correlation) 
Figure 6. Plot of threG-parameter model standardized residuals versus 
item discrimination. 1.6 1 ^ ' 



for lower discriminating items. Similarly, Figures 3 through 6 display' 
the plots of the residuals versus item discrimination for the two- and 
three-parameter models. These plots again suggest strong agreement 
between the residuals except for low discriminating items where a 
slightly wider variation of misfit was found with the raw residuals. 

Next, a check on the degree of similarity between raw and 
standardized' residuals was carried out with the one-parameter model 
results. Using '2.0 as the cut-off point on the absolute-valued 
standardized residual scale, 102 "bad" items were identified. Next, 
the poorest fitting 102 items on the absolute-valued raw residual score 
scale were identified. Ninety percent of the items were cpmmon to the 
two analyses indicating a high level of agreement in the identif icatibn 
of misfitting items. (Were agreement due to chance factors only, about 
15% of the items would have been common to the two analyses.) Because 
of the small number of misfitting items by the two- and three-parameter 
models, similar analyses with these models" were not carried out. 

The average of absolute-valued raw and standardized residuals at 
12 ability levels with the three logistic models are reported in Table 
2. The average raw and standardized residual statistics provide 
information about the size and direction of the misfit between the 
observed and expected results while the absolute-valued statistics 
ignore the direction of misfit and consider only the magnitude of the 
misfit. Since the trends in the results across the foifr Math booklets 
were the same, only the results for one booklet are reported in this, 
paper. * 

Three of the four statistics in Table 2 present a similar picture 
of fit for the three item response models. Both the two- and 
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Table 2 



Average and Absolute Average Raw and Standardized "Residuals at Twelve Ability Levels 
with the One-, Two-, and Three-Parameter Logistic Models 
(Booklet No. 1, 9 Year Olds, 65 Items, 19*77-78) 



LOglS Lie 


oouipic 






0 




Ability Level 










°2.75 


Total 


Model 


size 




-2.25 


-1.75 


-1.25 


-.75 


-.25 


. .25 


.75 


1.25 


1.75 


2.25 


(unweightt 


1 


2495 


27 


43 


111 


220 


331 


485 


446 


395 


276 


122 


21 


8 




2 


2495 


• 12 


49 


• 110 


231 


379 


466 


466 


349- 


273 


99 


39 


15 




3 


2495 


29 


50 


108 


212 


333 


470 


470 


403 


273 


100 


21 


9 




1 




.002 


.001 


-.001 


.001 


.002 


.002 


.002 


-.006 


-.009 


-.013 


-.003 


-.005 


-.002 


2 




.004 


.005 


-.017 


.009 


-.003 


-.003 


-.004 


-.006 


-.001 


.003 


.005 


.031. 


.006 


3 




.004 


.010 


.010 


.003 


.001 


.001 


.002 


-.002 


-.005 


-.012 


-.005 


.006 


-.001 ■ 


1 




.00& 


.088 


.074 


.073 


.045 


",.030 


.027 


.043 


.057 


.076, 


.071 


: .084 


.061 


2 




.052 


.048 


.042 


.021 


■ .017. 


.018 


.013 


.018 


.017 


.033 


.038 


.075. 


.033 


3 ^ 




.049. 


.040 


.034 


.019 


.020 


.015 


.010 


.013 


.015 


.025 


.043 


.073 


.030 


;ed 1 




.77 


.99 


■ ,89 


.79 


.37 


.20 


.14 


-.28 


-. 26 


-.39'^ 


^11 


-.10 


.25 


■ 2 




.09 


.31 


.76 


.35 


.09 


-.22. 


-.30 


-.37 


-.18 


-.06 


-.02 


-.22 


.06 


3 




.00 


.24 


.27 


.12 


.16 


.04 


. .08 


-.18 


-.48 


-.36 


-.32 


-.16 


.05 


;ed| 1 




1.75 


2.40 


2:82 


3.35 


2.35 


1.80 


1.62 


2.35 


2.64 


2.40 


1.19 


.85 


2.13 


2 




.-82 


1.28 


1.58 


1.00 


.90 


1.15 


.83 


1.03 


.93 


1.12 


.97 


1.07 


1.06 


■ 3 




.81 


.90 


1.02 


, .74 


1.00 


.94 


.62 


.87 


.99 


.85 


.91 


.88 


.88 



3 • ,19 
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three-parame ter models provided a very good accounting of the actual 
results. ' The one-parameter model did not. The fourth statistic 
(raw residuals) described model-data fit much differently. 

Discrepancies between these two impressions of model-data fit can 
be accounted for- by examining the way in which average raw residuals 
are computed. A comparison of the size and direction of model-data 
misfit between the one- and three-parameter models for one ability 
category (-2.00 to -1.50) is shown in Table 3. The direction of misfit 
can either be positive or negative based on whether the model has 
under predic ted or overpredic ted examinee performance. As can be seen' 
from Table 3, a considerable amount of misfit in both directions . 
occurred with the one-parameter model. This finding was not surprising 
since it was already noted that the items varied substantially in 
levels of item discrimination. The one-parameter model assumed a 
common itdm discrimination across the set of items. But because there 
was considerable deviation from this average item discrimination the 
results were (O large sized residuals in both directions and (2) a 
very small overall average raw residual. 

Hypothesis Testing 

The results in Tab.le 4 through 6 suggest reasons for model-test 
data misfit. Table 4 displays the results from an analysis of the 
relationship between the size of the standardized residu^ils and the 
level of classical item difficulty. Substantial improvement in fit 
occurred for hard items when the three-parameter model was fit to the. 
test data. For easier items better fits were obtained again by the 

ERIC 
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Table 3 



Comparison of the Size and Direction of Model-Data 
Misfit for One Ability Category (-2.00 to -1.50) 
(Booklet No. 1, 9 Year Olds, 1977-78) 



Logistic 
Model 


Size 

(Reported in 


of Misfit 

Each Direction) 


Average 
Residual 


1 


2.385 


2.434 


-.001 


3 


1.427 


.795 


.010 



21 
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Table 4 



Association Be^tween Standardized Residuals 
and Item Difficulties 
(Booklets No. 1 and 2, 260 Items,. 9 and 13 Year Olds, 1977-78) 



Di'^f iculty 
Level 


Standardized 
Residuals 


1-p 

N 


Results 
% 


2-p 
N 


Results 
% 


3-P 

N 


Results 
% 


Hard (pl.5) 


|sr| (<_1.0) 


■ -14 


11 


69 


56 


99 


80 




1sr1(>i.o) ' 


110 


89 


55 


44 


25 


20 


Easy (p^.5) 


. IsrI (11.0) 


34 


33 


87 


64 


98 






|sr| (>i.o) 


102 


672 


. 49 


36 


38 


18 



22 
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Table 5 



Descriptive Statistical Analysis of the Absolute-Valued Standardized Residuals 
(Booklets No. 1 and 2, 260 Items, 9 and 13 Year Olds, 1977-78) 



Difficulty 
Level 



Format 



Number 

of 
Items 



Hard (p<^.5) Multiple-Choice 70 
Open-Ended 54 



1-p Results 2-p Results S-^p Results 
Y SD X SD X ' SD 



2.73 1.55 1.18 .53 : • .82 



1.64 



,81 



.92 .38 



.86 



.23 



,28 



Easy (p>.5). Multiple-Choice 70 
Open-Ended 66 



1.79 .1.10 
1.67 .72 



,94 



,40 ; .90 ^' .64 



.97 .30 



.97 



,38 



23 
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Table 6 



ReTatiotiship Between Item Discrimination Indices 
and Standardized Residuals 
(Booklets No. 1 and 2, 260 Items, 9 and 13 Year Olds, 1^^77-78) 



Model 


Standardized 
Residuals 


-.01 to .30 

r 


Discrimination 
.31 to .50 .51 


Indices 
to .70 


.71 


to 1.00 








(29)^ 






(125) 




(51) 


1-p 


0.00 
1.01 

over - 


to. 1.00 
to 2.00 
L . UU 


o.p 

Q.O'' 
100.0 




10.9 
32.7 
56.4- 


33.6 
62.4 
4.0 




0 . 0 
29.4 
70. 6 










143.7 


d.f. = 6 


P = 


.000 


f 








Eta = 


.691 






\ 

> 






0.^00 
1.01 

oyer 


to 1.00 
to 2.00 
2.00 


51.7 
• 41.4 
6.9 




49.1 , 
41.8 
9.1 


60.8 
36.0, 


74 .5 
25.5 

0. b 

■'■J 








7. ^ 


11.58 


d.f. = 6 


p = 


.072 










Eta ~ 


.203 










3-p 


0.00 
1.00 

over 


to 1.00 
to 2.00 
2.00 


75.9 
20.7 
3.4 




80.0 
18.2 


76.8 
23.2 
0.0 




68.6 
29.4 ^ 
2.0 




1* 




y} = 

Eta = 


5.28 
.092 


'd.f. =6 


P = 


.508 





Number of test items in brackets. 
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Chree^parame ter model although there was a less dramatic shift in fit 
between the two- and three-parameter models. These findings suggest 
that examinee guessing was an important factor with the harder items 
and less consequential with easier items, 

-Table 5 provides a summary of the absolute-v^alm^d standardized 
residuals for the three logistic models with items classified by 
difficulty and, format. For both hard and easy open-ended items and 
easy multiple-choice items the pattern of results were- the sam*^. 
Substantial improvements in fit were obtained When thd two-pararr sr 
model was substituted for the one-parameter model. The two- and i * * 
parameter results however were similar. For the hard mul tiple-ch^ 'c 
items a subst;an t ial ly different pattern emerged. Fitst, the size^of 
the standardized residuals was on tVie average substantially larger for 
the one- and two-parameter models. ^ Second, there were considerable 
improvements in fit between the one- and two-, and the two- and 
three-parameter models. This result strongly suggests that examinee 
guessing on hard multiple-choice items affects tlie degree of model-data 
fit and therefore the "pseudo-chance level" parameter was useful. 

Finally, Table 6 reveals the relationship between item 
discrimination and standardized residual size. For these items varying 
greatly in levels of item discrimination, the best fit occurred with 
the three-parameter model. Items with relatively high or low item 
discrimination indices were poorly fitted by the one-parameter model. 
This resulted in a strong curvilinear relationship as represented by an 
eta value of .691. Substantial improvement in fit occurred when the 
^two-parameter model replaced the one-parameter model. 

25- 



The previous analyses presented results about trends of misfit 
across a numj)er of test items. Were there any specific reasons why 
particular items misfit a certain model or, models? To answer this 
question, items and their corresponding standardized residuals with the 
three models were scrutinized individually. Four different patterns 
emerged:' (1) substantial improvement- in the fit by using the two- or 
three-parameter models, (2) similar fit across the three models, (3). 
best degree of fit by using the three-parameTter ^model and (4) best 
degree of fit by^using two-.parameter model. For each pattern, a 
representative item was examined carefully in order to identify 
possible salient item characteristics causing these instances of misfit 
'and fit. Table 7 contains the results of this analysis. The four test 

items are,, shown in Figure 7. 



With Item 36, significant improvement in .model-data fit occurred 
when the two-parameter model replaced the one-parameter model. The 

■ o " ~ . 

classical item statistics showed the item as being" non-discriminating 
(r=-..01) and difficult (p=.2l) due, in part, to the unusual nature of 
the test question (i.e., subtracting ranges ^of numbers) and the overlap 
in the answer choices. With the two- aad Jhree-parameter models it was 
possible to account for the very low dj.scriminating power of the test 
item. With the one-parameter model it was not and hence the poor model 
data fit. 

' Item 44 was fit by the three models in a similar fashion. The 
classical item statistics reveal that the item had middle level of 
difficulty (p=.68) and discrimination (r=.59).. The item had an 
open-ended format and thus guessing was an inconsequential 
consideration in item performance. Therefore the additional effort 
made to incorporate "item discrimination" and "pseudo-guessing" 
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Table 7 

Representative Items for Four Patterns of Model Misfit 
(Math Booklet No. 1, 13 Year Olds, 1977-78) 



Item 
Numb e r 


1 SR, 1 
111 


1 SRol 

1 Z 1 


ISRol 

1 J 1 


Description 


Possible ExplanationCs) 


36 


7.08 


1.02 


1. 19 


Substantial improvement 
in fit by using the 2-P 
or 3-P models over the 
1-P model 


Unusual item wording; overlap 
of answer choices ; non- 
discriminating and difficult 
i t6m 


44' 


1.58 


2.14 


1.93 


Similar fits for 
the models 


Open-ended format; average 
level of item discrimination 


23 


2.85 


1.49 


.71 


Improvement in fit from 
using the 3-P model 
rather than the 1-P or 
2-P model 


Multiple-choice format ; 
relatively difficult and 
ri 1 cr« T" 1 mi n a ^ 1 n a * substantial 
amount of guessing 


4 


3.11 


.94 


1.94 


Best fit from the 
2-P model 


Open-ended format; extremely, 
discriminating; misfit of 3-P 
model occurred at the highest 
ability level due to a 
highly unstable standardized 
residual 



\ 



21 
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Figure 7. Four sample test items. 



36, Ms. Baker has between $8,000 and $8,500 in her savings account. i 
She wants to buy a new car that costs between $5,300 and $5,400. ; 
After she buys the car, how much money will Ms. Baker have in her.| 
savings account? 

0 $2,700 

0 $3,100 

0 Between $2,700 and $3, 100 

0 Between $2,600 and $3,200 

0 I don't know. 

44. Find the quotient. 

A. 6)608 ANSWER 

23. When is the product of two integers negative? 

0 When both are positive 

0 When both are negative 

0 When one is negative and one is positive 

0 When one is zero and one is negative 

0 I don't know. 




What is the length of this pencil to the nearest quarter inch? 

ANSV7ER "^^ inches 



\ 
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parameters did not increase the amount of model-data fit. 

For Item 23 considerable improy.ement in fit occurred when the 
three-parameter model was substituted for the one- and two-parameter 
models. This multiple-choice item was quite difficult (p=.36) and 
moderately discriminating (r=.38) but substantially lower than the 
average discriminating power of items in the test. The similarity in 
the answer choices may have causecf a'^cpnsiderable amount of guessing 
even though "I don't know" was an answer alternative^- — Iherefore the 
three-parameter model accounted for the test data best. 

Finally, with item 4 a fourth pattern of misfit is revealed^ 

\ 

According to the size of the standardized residuals, the two-parameter 
model fits the test data best. This item was very discriminatiiig 
(r=,81) and moderately difficult (p=.52). The high-level of item 
discrimination would explain improvements in fit by substituting the 



two-parameter for the one-parameter model. 

Figures 8 and 9 show the plots of standardized residuals versus 
ability. These plots help explain why the two-parameter model appeared 
tro fit the data better than the three-parameter model. For the 
examinees in the ability range between 2.50 and 3.00 the^ 
three-parameter model over-predicted performance. But because of the 
very small standard error due to the easiness of the test item for high 
ability examinees, the standardized residuals "blew-up." This 
occurrence is observed with statistics such as the chi-square test when 
expected values are very small. 
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Figure 8. Standardized residual plots 
obtained with the two-parameter model for 
Item 4. ' . , 



ERIC 
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Figure 9. Standardized residual plots 
obtained with the three-parameter model for 
Item 4. 
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DiscuSflion 

The results from this study showed that the statistics on average 
raw and standardized residuals provided very useful fit information, 
but that when compared, the statistics based on standardized residuals 
presented a more accurate picture of mode 1 -da t a f i t • Standardized ' 
residuals take into account the sampling error associated with the 
estimates of average perform.ance at various ability levels. Raw 
residuals do not. Accounting for the instability in the statistical 
information seems important when assessing model-data fit. 

The results of our work on the topic of* hypothesis testing showed 
clearly that with the type of test i t ems we worked with failure to 
consider variation in item discriminating power resulted in the 
one-parameter model providing substantially poorer fits to the various 
test data sets than the two- or three-parameter models. Also, examinee 
guessing on d i f f i cu l-t~mu 1 t-ipl e-cho ic e items affected the degree of 
model-data fit. Here, substantial improvement in fit occurred when the 
"pseudo-guessing" parameter was used in the item response model'. These 
results were not surprising given that the test items in the NAEP test 
. booklets varied considerably in their biserial correlations and a 
substantial number of the multiple-choice items were difficult to 
answer for low ability examinees. In summary, the results collected in 
relation to the various hypotheses were invaluable for providing 
insights about model-data fit. - 

Finally, it is our opinion that the results from this study will 
be of interest and value to measurement specialists who are considering 
the usefulness of item response models in their work. Since one cannot 
assume that there is an adequate fit between a chosen model and a 
particular data' set, the goodness of fit issue must be addressed. The 
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analysis of residuals has been suggested aa a method for determining 
sufficient model-data fit. We believe the procedures and methods 
suggested in this paper (including calculating average and 
absolute-valued averages and plotting residuals versus ability) will 
provide insights about the usefulness of the one-, two- and 
three-parameter models, as well as many other item response models. 
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